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SUMMARY 

Supercomputer optimizations for a computational method of solving stochastic, multibody, dynamic 
programming problems are presented. The computational method is valid for a general class of optimal 
control problems that are nonlinear, multibody dynamical systems, perturbed by general Markov noise in 
continuous time, i.e., nonsmooth Gaussian as well as jump Poisson random white noise. Optimization 
techniques for vector multiprocessors or vectorizing supercomputers include advanced data structures, loop 
restructuring, loop collapsing, blocking, and compiler directives. These advanced computing techniques and 
supercomputing hardware help alleviate Bellman’s curse of dimensionality in dynamic programming 
computations, by permitting the solution of larger multibody problems. Possible applications include 
lumped flight dynamics models for uncertain environments, such as large scale and background random 
aerospace fluctuations. 


INTRODUCTION 

The objective of this research is to provide a general, highly optimized, computational treatment of 
stochastic optimal control applications in continuous time. Advanced computing techniques have been 
implemented so that stochastic dynamic programming algorithms can be used to solve larger optimal 
control problems than possible by ordinary computing methods. Optimization techniques will help alleviate 
Bellman’s curse of dimensionality , in that the computational and memory requirements grow exponentially 
as the dimension of the state space increases, limiting the size of the control problem that can be 
computed. Computer optimization techniques can help alleviate Bellman’s curse by permitting larger, but 
still hardware limited problems to be computed. Optimization consists of parallelization and vectorization 
methods to enhance performance on advanced computers, such as parallel processors and vectorizing 
supercomputers. Preliminary results for massively parallel processors are also presented. 

General Markov random noise in continuous time consists of two kinds, Gaussian and Poisson. 
Gaussian white noise, being continuous but nonsmooth, is useful for modeling background random 
fluctuations, such as turbulence and moderate environmental variations. Poisson white noise (its frequency 
spectrum is also flat like Gaussian noise), being discontinuous, is useful for modeling large random 
fluctuations, such as shocks, collisions, unexpected external events and large environmental changes. Our 
general feedback control approach combines the treatment of both linear and nonlinear (i.e., singular and 
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nonsingular) control through the use of small to moderate quadratic costs. The methods also handle 
deterministic and stochastic control in the same code, making it convenient for checking the effects of 
stochasticity on the application. Some actual applications are models of resources in an uncertain 
environments [16], [13], [8], 

The Markov, multibody dynamical system is illustrated in Figure 1 and is governed by the stochastic 
differential equation (SDE): 

dy(s) = F(y,$,u)ds + G(y,s)dW(s) + H(y,s)dP(s) , (1) 

with initial value y(t) = ar, 0 < t < s < tj, y(s) £ V y , u £ X> u , where y(s) is the m x 1 multibody state 
vector at time 5 starting at time f, u = u(y, s) is the n x 1 feedback control vector, F is the m x 1 
deterministic nonlinearity vector, W is the r-dimensional normalized Gaussian white noise vector, P is the 
independent g-dimensional Poisson white noise vector with jump rate vector [A{] gx i, G is an m x r diffusion 
coefficient array, and H is an m x q Poisson amplitude coefficient array. In a more general treatment, the 
Poisson jump amplitude can also be random. 

The control criterion is the optimal expected cost performance, 

F*(x,0 = min [IVffiAN [V[y, 5, u,P, W] | y(<) = x]] , (2) 


where the random total cost is 


V[y,i,u,P,W] = Jt ds C(y(s), s, u(y(s), s)) , (3) 

on the time horizon (t, £/). The instantaneous cost function C = G(x,t,u) is assumed to be at least a 
quadratic function of the control, 


C(x,f,u) = C 0 (x,f) + Cf(x,t)u + |u r G 2 (x,t)u . (4) 

C 2 is assumed to be positive definite, so that large controls are much more costly on a per unit basis. In 
addition, the dynamics in (1) are assumed to be linear in the controls, 

F(x, £, u) = F 0 (x,f) + Fi(x,f)u, (5) 

remaining nonlinear in the state variable x. 

The Bellman functional PDE of dynamic programming (or Hamilton-Jacobi- Bellman Equation), 

dV* 

o = + H[V'} 

r)V* 

= — + F^W* + iGG^x.t) : VV t F* 

+ vmx + HKx,*),*)-^,*)] 
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follows from the generalized ltd chain rule for Markov SDEs as in [7] and [16]. Here, U* is the optimal 
feedback control computed by constraining the unconstrained or regular control, 

Uj»(*,0 = + FjW) , (7) 

to the control set D u , under the assumption of positive definite quadratic costs. In general, the Bellman 
equation (6) is nonlinear with discontinuous coefficients due to the quadratic last term, 

(|U - U R ) r C 2 U* , in (6) and due to the compact relationship between the constrained, optimal control 

and the unconstrained, regular control, 

* 

Ui (x,f) = min[^ mBX . 1 -,max[7' nun . I ,^ H . i (x,<)]], (8) 

for i = 1 to n controls. Here, U^n is the minimum control constraint vector and U max is the maximum. 

As the constraint components are attained, the optimal control component U i , changes from the regular 
control component, U R ,i> to components of the constraints, U m i nm i or U maXt which in general are functions 
of state and time. In (6), the symbol (:) denotes the scalar matrix product A : B = YaLi Y^i AijBij, 
assuming B is symmetric. It is important to note that the principal equation, the Bellman equation (6), is 
an exact equation for the optimal expected value V and does not involve any sampling approximations 
such as the use of random number generators in simulations. 

As the number of state variables, m, increases, the spatial dimension rises, and computational 
difficulties are present that can compare to those of three-dimensional fluid dynamics computations. Thus 
there is a great need to make use of advanced-architecture computers, to use parallelization as well as 
vectorization, in order to solve larger state space systems. The Panel on Future Directions in Control 
Theory [6] stresses the importance of making gains in such areas as nonlinear control, stochastic control, 
optimal feedback control and computational methods for control. This paper is a report on our efforts to 
treat all of the above mentioned areas combined from the point of view of computational control. 

SYMBOLS 
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cost coefficients (eq. (4)) 

state mesh increment, time increment (eq. (9)) 
nonlinearity function coefficients (eq. (5)) 

Gaussian noise amplitude matrix (eq. (1)) 

Poisson noise jump amplitude (eq. (1), (6)) 

Hamiltonian for Bellman Equation (eq. (6)) 

indices for state mesh points (eq. (9), (15), (18)) 

state dimension, number of mesh points for each state (eq. (1), (9)) 

control space dimension (eq. (1)) 

Poisson noise vector (eq. (1)) 

Poisson noise dimension (eq. (1)) 

Gaussian noise dimension (eq. (1)) 
forward time variable (eq. (1)) 

backward time variable, final time, discrete time (eq. (3), (9)) 
control vector (eq. (1)) 

regular control, optimal control (eq. (7), (8)) 

control constraint vectors (eq. (8)) 

total cost, optimal expected total cost (eq. (3), (2)) 

Gaussian noise vector (eq. (1)) 

initial state vector, discrete state (eq. (2), (9)) 
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forward state variable (eq. (1)) 

component of Poisson jump rate vector (eq. (6)) 


THE BASIC COMPUTATIONAL PROCEDURE 


The integration of the PDE in (6) is backward in time, because V* is specified finally at the final time 
t - tj rather than at the initial time. A summary of the discretization in state and backward time is 
given below: 


[X&UX1 = [*ti + & ~ l )- DX i. 


m X 1 


j = [j»]mxi , where ji - 1 to Mi , 'for i = 1 to m ; 

s — > Tfc = tj - (k - 1) ■ DT , for fc = 1 to K ; 

v '(Xj.TO — v M ; «lv"](Xj,r l+ .) — n. , i 


2 JV 2 

where DX{ is the mesh size for state i and DT is the step size in backward time. 
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The numerical algorithm is a modification of the predictor corrector, Crank Nicolson methods for 
nonlinear parabolic PDEs in [5]. Modifications are made for control feedback, switch term optimization 
and delay term calculations. Derivatives and Poisson induced differences are approximated with an 
accuracy that is second order in the local truncation error 0 2 (DXi ), at all interior and boundary points. 
Even though the Bellman equation (6) is a single PDE, the process of solving it not only produces the 
optimal expected cost V*, but also the optimal expected feedback control law U . This is because the 
Bellman equation is a functional PDE, in which the computed regular control feeds back into the optimal 
cost and the optimal cost feeds back into regular control through its gradient. The nonstandard part of the 
algorithm is to decompose this tightly coupled analytical feedback system so that both the cost and the 
control can be calculated by successive iterations, such that each successive approximation of one quantity 
improves the next approximation of the other quantity. While our procedure may look superficially like a 
standard application of finite differences, it is not due to the nonstandard control features mentioned 
above. For these reasons, we are not aware of any other successful stochastic dynamic programming code 
that treats anywhere near the generality of applications that we treat and with the advanced computing 
techniques that we use, especially with regard to Poisson noise. Variations of this algorithm have been 
successfully utilized in [16] and [8]. Quadrat and his co-workers [1] discuss several algorithms for stochastic 
dynamic programming problems that admit stationary solutions, and describe an expert system for their 
solution. 


Prior to calculating the values, at the new ( k 1)** time step for k — 1 to K 1, the old 

values, Vjjt and are assumed to be known, with Vj 0 = V^. The algorithm begins with an 

convergence accelerating extrapolator (x) start 
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The extrapolated values are use to calculate updated values of the gradient DU, the second order 
derivatives DDV, the Poisson functional terms (V r * evaluated at (x + H)), the regular control UR, the 
optimal feedback control U*, and the spatial functional of the Bellman equation. These 

evaluations are used in the extrapolated predictor (xp) step: 


V : 
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which are then used in the predictor evaluation (xpe) step: 


y( x P e ) _ l/y(xp) 

j,fc+i j' fc+1 


+ v jJfc ) , 
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and continuing with other terms of the spatial functional H. The evaluated predictions are used in the 
corrector (xpec) step : 

(13) 


y(xpec ,7 + 1 ) _ ,, nT -j/(xpe, 7 ) 

l j.fc+i - U* + Ui ' K y k+ l 


for 7 = 0 to 7 max until the stopping criterion is met, with corrector evaluation (xpece) step : 

T/( x pece ,7 + 1 ) _ ln r<xpec ,7 + 1 ) „ > / 14 > 

V jJt+ I “ 2 x j ,fc+i + Kj.*)’ 

The stopping criterion for the corrections is formally derived from a comparison to a predictor corrector 
convergence criterion for a linearized, constant coefficient PDE. A robust mesh selection method is used to 
determine the stopping criterion, so that only a couple of corrections are needed, except at the first time 
step. The proper selection of the time to state mesh ratio guarantees that the corrections for the 
comparison equation converge, whether the Bellman equation is parabolic-like when the Gaussian noise is 
present or hyperbolic- like when there is no Gaussian noise. 


Current efforts are concentrated on implementing the code on the Alliant FX/ 8 , Cray X-MP/48, 
Cray 2S/4-128, and the Connection CM -2 for more general multi-state and multi-control applications. In 
order to implement the code for arbitrary state space dimension, a more flexible data structure is needed 
for the problem arrays, F, G and H , as well as for the solution arrays, V along with its derivatives and U. 

The advantages of the algorithm is that it 1) permits the treatment of general continuous time 
Markov noise or deterministic problems without noise in the same code, 2 ) maintains feedback control, 3) 
permits the cheap control limit to linear singular control to be found from the same quadratic cost code, 
and 4) produces very vectorizable and parallelizable code whose performance is described in the next 
section. 


ADVANCED SUPERCOMPUTER OPTIMIZATION 

The code for the algorithm has been developed and tested on three advanced architecture machines, 
the ACRF Alliant FX /8 vector multiprocessor at Argonne National Laboratory; the NCSA Cray X-MP/48 
and the NCSA Cray 2S/4-128 at the University of Illinois in Urbana; the massively parallel Connection 
Machine CM-2 at both the ACRF and NCSA. The Alliant FX/ 8 , with its superb concurrent outer, vector 
inner (CO VI) parallelizing compiler, is mainly used to test for the parallelization of the code. The Cray 
X-MP/48, noted for its very fast pipelined processing unit, is used for the testing of small and moderate 
size code (less than 1 MW, where MW denotes a megaword or one million words). As the number of states 
grows, the problem size grows exponentially, we have to make use of the huge internal memory (up to 128 
MW) of the Cray 2S/4-128 or large numbers of parallel processors on the Connection Machine CM- 2 . 

The present code under testing has been obtained from the three-state, three-control modification 
of Hanson’s two-state, two-control resource model [ 8 ]. Modifications have been made to the present code so 
that it can apply to arbitrary number of state variables and mesh points by just changing a few 
parameters, numbers of state variables m and mesh points M . 

Initial parallelization and vectorization of the algorithm were done by what might be called the 
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“Machine Computational Model Method,” i.e., tuning the code to optimizable constructs that are 
automatically recognized by the compiler, with the Alliant FX/8 vector multiprocessor [2] in mind. All 
inner double loops were reordered to fit the Alliant concurrent- outer, vector-inner ( COVT) model. All 
non- short single loops were made vector- concurrent. Short loops became scalar-concurrent only. Multiple 
nested loops were reordered with the two largest loops innermost. A total of 37 out of 39 loops was 
optimized for the two-state code, two-control model with Poisson noise. Detailed results are reported in [8], 
[9] and [10]. 

The relative performance of column oriented versus row oriented code is discussed in [111. 
Dongarra, Gustavson, and Karp [4] have demonstrated that loop reordering gives vector or supervector 
performance for standard linear algebra loops on a Cray 1 type column oriented FORTRAN environment 
with vector registers. However, for the stochastic dynamic programming application, the dominant loops 
are non-standard linear algebra loops, so that the preference for column oriented loops is not a rule, as 
demonstrated on the Alliant vector multiprocessor [11]. The present code under testing has up to four 
states and controls, with Gaussian as well as Poisson noise. This code is a general modification of the 
two-state, two-control model. 


Vector Data Structure 

In the original code, the data structure for the problem arrays, F and G, the solution arrays V, the 
derivative arrays, and the control arrays U, depend on all the numerical node indices, js(is ), for all state 
variables. The resulting data structure takes the form: 

F{is,js(l),js(2), ■ ■ ■ ,js(m)) (15) 

for each state equation, is ~ 1 to m, with the nonlinearity function used as an example. If it is assumed 
that there are a common number M — M\ — M 2 — • • * = M rn of nodes per state, then js(is) = 1 to M 
points for is = 1 to m states. As a consequence, the typically dominant loops for the computation of the 
nonlinearity function F, the solution gradient DV } and similar arrays, are nested to a depth of at least 
m + 1. A typical loop will take the form: 

do 1 i = 1, m 

do 1 jl = 1, M 


do 1 jm = 1, M 


1 F(i,jl,j2, ',jm) = 

This state size dependent loop nest depth level of m + 1 inhibits the development of general multibody 
algorithms, especially when the state size m is incremented and the number of loops in each nest has to be 
changed. Also, vectorization is inhibited for compilers that vectorize only the most inner loop. As the 
number of states grows, the computational load will grow asymptotically like some multiple of 

m ■ M m = (16) 

i.e., the load grows exponentially in the number of states m. The exponential growth in (16) is merely a 
quantitative expression of Bellman’s curse of dimensionality. 

One way around this inhibiting structure is to use a vector data structure [12]: 

FV(isJv) 
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to replace the original hypercube type of data structure in (15), using the nonlinearity vector as an example, 
such that all the numerical nodes are collected into a single vector indexed by the global state index jv , 
where jv = 1 to M Tn over all state nodes. 

TTL 

3 V = - !) • + 1. (18) 

t = l 

in the the case that the state mesh size, Mi has a common value of M for all i. 

Both the direct mapping from the original data structure to the vector data structure and the 
inverse mapping are needed to compute the amplitude functions, F, G and H , as well as the derivatives of 
V , because these quantities depend on the original formulation. The pseudo-inverse of the vector index in 
(18) can be shown to permit the recovery of the individual state indices by way of integer arithmetic; 

m 

jv) = 1 + [jv - 1 - U s (i‘>i v ) ~ 1) • (19) 

i=t*+l 

recursively, for is = m to 1, by back substitution, with = 0, as long as each state has the same 

number of discrete nodes M . The vector data structure of (17) to (19) results in major do loop nests of the 
order of 1 to 2, rather than order of m + 1. 

A typical vector data structure loop has the form 

do 2 i = 1, m ! parallel loop. 

do 2 jv = 1, M * *m ! vector loop. 

2 FV(ijv) = 

resulting in collapsing the loop nest depth from m + 1 to a depth 2, independent of the number of states m. 
This is analogous to the automatic compiler technique of loop collapsing on the Alliant for simple loops. 

Table I shows the performance of the code for m = 3 states and M ~ 16 nodes per state on the 
Alliant FX/8 at Argonne National Laboratory’s ACRF. implemented and rim on vector multiprocessors 
which will be discussed in the following two subsections. 

Parallelization in Alliant FX/8 

When loop 2 above is executed on multiprocessors such as the Alliant FX/8, due to the CO VI 
(concurrent-outer, vector-inner) compiler optimization scheme, the z-loop will rim in parallel while the 
jv- loop is vectorized. For machines with such architecture, the gain in speed is achieved through the full 
exploitation of all its processors. If the number states m is less than the maximum number of processors 
(the maximum number of processors is eight on the Alliant), performance ceases to improve beyond m 
processors as demonstrated in Table I when the number of processors p is greater than three. The speedup 
Sp.2 = T\ j/Tpj for loop 2 also levels off at roughly 2.9 in this table starting at p = 3. This means 
degradation in efficiency because a large proportion of processors available are sitting idle. 

One simple modification of the loop structure will solve the problem by parallelizing and 
vectorizing the entire loop nest, further enhancing the performance. This is illustrated by the restructuring 
of loop 2 by a compiler directive in loop 3 below. 

do 3 jv = 1, M * *m ! vector-concurrent loop. 

CVDSL NOVECTOR 
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3 


do 3 i = 1, m ! scalar loop. 

FV(i,jv) = 

Due to the flexibility of the optimization scheme of the FX Fortran compiler, we can choose whichever 
loop we want to parallelize and vectorize by inserting suitable compiler directives, such as CVD$L 
NOVECTOR in loop 3. The i-loop is moved innermost and is forced to run in Scalar Mode by inserting a 
CVD$L NOVECTOR directive. The modification has two effects: 

i. the outer jVloop is forced to run in Vector- Concurrent Mode , hence, full parallelization of the entire 

work load can be achieved through self- scheduling by the compiler; 

ii. moving the {-loop inner-most increases the chunk or grain size of each iteration, while overhead for 
parallelization and vectorization is lessened. 

The modification leads to an improvement of 46% of computing time for the code running by 8 
processors in the Alliant. 

Table II shows the performance of the modified code, for m — 3 states and M = 16 points per 
state, on the Alliant FX/8 at Argonne National Laboratory’s ACRF. The speedup, also given in the table, 
reaches a good value over six times executing on all eight Alliant processors using the form of loop 3. The 
last column compares the results of using loops 2 and 3 for the main stochastic dynamic programming 
loops and shows that the loop 3 form outperforms the loop 2 form by 1.85 times on all eight processors. 
Thus, the restructured loop 3 gives better load balancing that the pure vector data structure of loop 2. 

Parallelization on the Cray 2 

Parallelization in the Cray 2S/4-128 is done through multitasking. Basically, the compiler follows the 
COVI optimization scheme that the outer loop will run in parallel and inner loop is vectorized. In a 
multi-user environment such as that in NCSA, improvement through multitasking is hard to measure 
unless the code is run in a dedicated machine. Therefore, performance utilities such as Job Accounting (ja), 
are used to get an approximate measure of the CPU time and speed-up obtained. 

Table III shows the performance of multitasking on the NCSA Cray 2S/4-128. Note that the 
timings grow drastically as either the state dimension m and the common mesh size M increase. 

Performance on the Connection Machine 

As the number of states increase, the performance obtained from Cray shows an exponential growth as 
in Table IV. Thus for a larger size problem, another solution would be to implement the problem on a 
massively parallel computer system. The Connection Machine CM-2 at the NCSA has 32A" or 32,768 bit 
processors and one floating point processor for every 32 bit processors. 

The preliminary results obtained from the Connection Machine are shown in Table IV. We 
implemented the problem in the Fortran 8X language with array notation extensions and two dimension 
data structure. Also the CM-2 directives CMFSLAYOUT and CMFSALIGN overlay different size of 
arrays, in order to reduce the internal communication time and hence improve the performance. The 
program is compiled with — O option and rim with at most 32 K data processors in single precision. The 
preliminary results on the CM-2 indicate that when the problem size increases, the Connection Machine 
computes the problem with relatively small increase in the execution time. For instance, for the case nx=4 
states, when mesh size per state increases from M=8 to M=16, the execution time increases by about 2.1 
times for Real Time (sum of CM-2 time and the time on the front-end computer) and 5.8 times for CM 
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Time in Table IV, while on the Cray the time increases by about 23.8 times, according to Table III, which 
is much larger than for the CM-2. 

It must be noted that the Cray and the CM-2 have different computational structures and our current 
Fortran 8X program is translated from our Cray algorithm and data structure. A further goal will be to 
modify the algorithm and data structure so that the performance on the CM-2 will be competitive with the 
performance on the Crays in an absolute sense. 

Computation for Boundary Points 

The computation of the solution gradient DV and the array of second derivatives DDV, which is 
carried out in the subroutine GETDV, requires different algorithms for the interior nodes and the boundary 
nodes. Due to the complexity and generality of the underlying stochastic dynamical system, the boundary 
values cannot be specified in general, but must be calculated from the Bellman Equation (6) itself, except 
for the most trivial boundaries and processes. Use of the Bellman Equation at the boundaries, makes the 
algorithm segments for updating the boundary values quite different from the interior values in order to 
maintain the same order of error as at the interior points, i.e., to avoid numerical pollution of the order, 
O(DX) 2 , at the interior points. When the vector data structure is used, the boundary nodes (js(is) = 1 
and M ) are scattered throughout the data arrays FV(isJv). Due to this nonuniform distribution of the 
boundary nodes, a time-consuming nested if-then-else loop has to be used in the original GETDV, which 
greatly degrades the computation speed. For the current testing code with m = 3 and M = 16, GETDV 
takes 34% of the running time in the Affiant runs and 30% in the Cray X-MP runs. One way to alleviate 
this degradation is by homogeneous global computation and then separate recorrection for the boundary 
points. 


Since the proportion of boundary nodes is generally small compare with internal nodes (2/A/ for 
M mesh points per state) and all of them can be extracted explicitly from the inverse vector index (19). 
Hence, we can pass the whole data array through a homogeneous computation first, taking all points to be 
internal nodes, then recorrect the boundary nodes outside the main loop. Artificial or redundant points are 
added to prevent overwriting valid data, and it will be seen the resulting small addition to the memory by 
the use of artificial points is worth the benefit in improved performance. 

Table V compares the performance for the old and new forms of GETDV for different mesh 
points M run on the CRAY X-MP. A faster run time for the new version of 1.45 times the old version and 
a saving of up to 31% of running time is exhibited. 

MEMORY REQUIREMENTS 

Since the memory requirements grows exponentially with increases in the state variable from 
Bellman’s curse of dimensionality (16), a machine with large internal memory is needed for the large state 
variable case. For the sake of a uniform comparison, all the testings were carried out on the NCSA Cray 
2S/4-128, which possesses a huge internal memory (up to 128 MW). Table VI summarizes the memory 
requirements for different test codes. Also in Table VI, the memory in words is compared to the order of 
magnitude of the Bellman’s curse of dimensionality term in (16). The approximate asymptote for large 
state dimensions m or very fine state meshes M is about 12, which gives the effective number of major 
loops of nest depth m + 1. The CPU time measurements in Table III have similar exponential growth 
characteristics to that of memory requirements. 

CONCLUSIONS 

Stochastic dynamic programming can be optimized for the a moderate and perhaps larger number of 
state variables using a vector multiprocessor. Loop collapsing using a vector data structure, compiler 
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directives making possible more efficient loop reordering, and homogeneous global computation making 
boundary value computation more efficient, all help obtain superior optimization of the stochastic dynamic 
programming code. Parallelization, vectorization, large memories, and other super computing features are 
important in solving larger state space problems In order to handle a large number of state variables, a 
large number of parallel processors with extremely large memory would be desirable, but Bellman’s curse 
of dimensionality appears to very much weakened. Computation with massively parallel processors, like the 
Connection Machine CM-2, is still preliminary, but shows promise for larger state spaces. These techniques 
are generally applicable to other vector and parallel computers. Our general code is essentially valid for 
general Markov noise in continuous time, feedback control, nonlinear control and the cheap control limit. 
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Table I: Timings on the Alliant with vector data structure 
for loop 2 with 3 states and 16 nodes per state. 


Number of 
Processors 
P 

User CPU 
Time (seconds) 
^,2 

Speedup 

Sp, 2 

Tit/Trf 

1 

83.68 


2 

56.23 

1.49 

3 


2.78 

4 

29.24 

2.86 

5 

29.20 

2.87 

6 

29.24 

2.86 

7 

28.77 

2.91 

8 

28.64 

2.92 


Table II: Timings on the Alliant with order and directive modified loops 
for loop 3 with 3 states and 16 nodes per state. 


Number of 
Processors 
P 

User CPU 
Time (seconds) 
T p , 3 

Speedup 

5p, 3 

2Wr p ,3 

Improvement 

Ratio 

Tp,2/T P J 

1 

95.32 

1.00 

0.88 


49.00 

1.95 

1.15 


34.61 

2.75 



25.74 

3.70 

1.14 


22.66 

4.21 

1.29 


18.82 

5.06 

1.55 


17.78 

5.36 

1.62 

8 

15.44 

6.17 

1.85 
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Table III: CPU time (seconds) for different state dimensions 
and different mesh sizes on the Cray 2S/4-128 with Multitasking. 


State 

Variables 

m 

Mesh Points M 

8 

16 

32 

T 

Sp 

T 

Sp 

T 

Sp 

2 

0.033 

1.60 

0.130 

3.66 

0.685 

3.54 

3 

0.104 

3.53 

1.169 

2.86 

24.626 

1.54 

4 

2.527 

3.14 

60.151 

1.73 

2338.290 

2.10 


Table IV: CPU time (seconds) for different state dimensions 
and different mesh sizes on the Connection Machine CM-2 (CM Time) 
and front-end (Real Time). 


State 

Mesh Points M 

Variables 

8 

16 

32 

m 

Real Time 

CM Time 

Real Time 

CM Time 

Real Time 

CM Time 

3 

10.97 

2.79 

21.51 

5.83 

52.53 

30.14 

4 

36.02 

11.42 

76.01 

66.41 

— 

— 


Table V: Performance comparison of the old and new forms of GETDV 
on the Cray X-MP/48 for loop 3 with 3 states. 


Number of 
Mesh Points 
M 

User CPU Time (seconds) 

Improvement 

Ratio 

Told/T n ew 

Per Cent 
Savings 

100(1 - T new /T M ) 

Old GETDV 

Told 

New GETDV 
T n ew 

8 

0.142 

0.098 

1.45 

31. 

16 

2.093 

1.523 

1.37 

27. 

24 

10.378 

7.679 

1.35 

26. 

32 

32.725 

24.234 

1.35 

26. 

40 

78.510 

58.467 

1.34 

26. 
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Table VI: Memory requirements for different state dimensions 
and different mesh sizes on the Cray 2S/4-128. 


State 

Variables 

m 

Mesh Points M 

Mesh Points M 

8 

16 

32 

8 

16 

32 

Memory (MW) 

Words /m • M m 

2 

0.13 

0.14 

0.16 

1000. 

270. 

78. 

3 

0.15 

0.28 

1.30 

98. 

23. 

13. 

4 

0.34 

3.28 

50.23 

21. 

13. 

12. 



Feedback in time dt / 

Figure 1: The stochastic multibody system with feedback. 
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